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Abstract 

We develop a new robust stopping criterion in Partial Least Squares Re¬ 
gressions (PLSR) components construction characterised by a high level of 
stability. This new criterion is defined as a universal one since it is suitable 
both for PLSR and its extension to Generalized Linear Regressions (PLS- 
GLR). This criterion is based on a non-parametric bootstrap process and 
has to be computed algorithmically. It allows to test each successive com¬ 
ponents on a preset significant level a. In order to assess its performances 
and robustness with respect to different noise levels, we perform intensive 
datasets simulations, with a preset and known number of components to ex¬ 
tract, both in the case n > p (n being the number of subjects and p the 
number of original predictors), and for datasets with n < p. We then use 
t-tests to compare the performance of our approach to some others classical 
criteria. The property of robustness is particularly tested through resampling 
processes on a real allelotyping dataset. Our conclusion is that our criterion 
presents also better global predictive performances, both in the PLSR and 
PLSGLR (Logistic and Poisson) frameworks. 
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1. Introduction 

Modelling relations by performing usual linear regressions between a uni¬ 
variate response y = ( 2/1 ,... ,y n ) T G R nxl , (.) T representing the transpose, 
and highly correlated predictors X = (xi,..., x p ) G R nxp , with n the number 
of subjects and p the number of predictors, or on datasets including more 
predictors than subjects, is not suitable or even possible, ffowever, with 
the huge technological and computer science advances, providing consistent 
analysis of such kinds of datasets has become a major challenge, particu¬ 
larly in domains such as medicine, biology or chemistry. To deal with such 
datasets, statistical methods have been developed, especially the PLS Re¬ 
gression (PLSR) which was first introduced by Wold et al. (1983) and Wold 
et al. (1984) and described precisely by Hoskuldsson (1988) and Wold et al. 
(2001). 

PLSR consists in building K ^ rk(X) orthogonal “latent” variables 
Tk = (t 1 ,..., t k), also called components, in such a way that T; v - describes 
optimally the common information space between X and y. Thus, these 
components are build up as linear combinations of the original predictors 
vectors, in order to maximise Cov(y,t fc |T fc _!) so that: 

p 

t k = Xw l =Y^ w* jk Xj, l^k^K (1.1) 

3 =1 

where w£ = ( w\k *,... ,w p k*) T is the vector of predictors weights, depending 
on y, in the k th component (Wold et al., 2001). 

Let K be the number of components. The final regression model is thus 
the following one: 
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with e = (ei,..., e n ) T the n by 1 error vector e 8 , verifying E ( e |T k ) — 0 n , 
Var (e |TV) = of x Id n and (ci,..., Ck) the coefficients of regression of y on 

T k . 

This model is not a linear one since the hat matrix H K = T k (T£T*)-‘T£ 
does not only depend on X but also on y. Thus, in almost all cases: 
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E(y|X)/£ c * £ 


W jk Kj 


k= 1 


v J=1 


(1.5) 


An extension to Generalized Linear Regression models, noted PLSGLR, 
has been developed by Bastien et al. (2005), with the aim of taking into 
account the specific distribution of y. In this context, the regression model 
is the following one: 
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9 (0) Ck 

k= 1 
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( 1 . 6 ) 


with 6 the conditional expected value of y for a continuous distribution or the 
probability vector of a discrete law with a finite support. The link function 
g depends on the distribution of y. 

Since k ^ rk (X) and Corr (t^, tfc 2 ) = 0, V (/ci, A^) £ [1,..., A'] 2 , issues 
concerning the ill-conditioned matrix X are solved, so that usual linear regres¬ 
sions can be used to estimate the c*, parameters. However, the determination 
of the optimal number of components K, which is equal to the exact dimen¬ 
sion of the link between X and y, is crucial to obtain correct estimations of 
the original predictors coefficients. Indeed, concluding K\ < K leads to a 
loss of information so that links between some predictors and y will not be 
correctly modelled. Concluding K 2 > K involves that useless information in 
X will be used to model knowledge in y, which leads to overfitting. 

In this article, we will first remind the reader about the most used existing 
criteria and introduce our adaptation of the so called bootstrapping pairs to 
the PLSR and PLSGLR as a new stopping criterion in part 2. We will also 
describe the global algorithm we developed to compare these different crite¬ 
ria through intensive datasets simulations. The aim is to analyse results we 
obtain with these different criteria depending on some increasing noise levels 
we insert both in X, through an additional component, and in y by adding a 
supplementary centred variable. Indeed, it is a common situation in domains 
such as chemistry or medicine, especially in allclotyping datasets, leading to 
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important issues concerning the extraction of a reliable number of compo¬ 
nents. In part 3, we will analyse the results we obtain in the PLSR framework 
before analysing PLSGLR results for the Logistic Regressions (PLS-LR) and 
the Poisson Regression (PLS-PR) in part 4. Results are treated for both 
n > p and n < p cases. In part 5, we will treat a real allelotyping dataset by 
comparing the number of components extracted by the different applicable 
criteria. We will also compare the robustness of our new bootstrap-based 
criterion through resampling processes, by approximating the distribution of 
these extracted number of components. Finally, in part 6, we will discuss the 
observed advantages and defaults of each criterion. 

2. Criteria compared through simulations 

2.1. Existing criteria used for comparison 

• In PLSR: 

1. Q 2 . This criterion is obtained by Cross-Validation (CV). We de¬ 
cide to check results either by a leave-one-out CV, i.e. q — n with 
q the number of parts the dataset is divided into, and with q cho¬ 
sen equal to five (5-CV) according to results obtained by Hastie 
et al. (2009) and Kohavi (1995). For each new component t*,, the 
following value is calculated: 

r>2 _ 1 _ PRESSk 
k L R.SS k - L > 

with RSSk-i the Residual Sum of Squares of the k — 1 components 
model and PRESSk the PREdictive residual Sum of Squares, cal¬ 
culated by CV, of the k components model. Tenenhaus (1998) 
considers that a new component t*. improve significantly the pre¬ 
diction of y if: 

a /PRESSk < 0.95 ^/RSS^Ti <*=> Q\ > 0.0975. 

Results linked to these two criteria will be graphically and respec¬ 
tively reported as Q21vlo and Q2K5. 

2. BICdof. As mentionned in the introduction, the hat matrix H k 
does not only depend on X, but also on y, so that corrected 
degrees of freedom ( dof ) have to be used in the expression of 
the BIC criterion (Schwarz, 1978). Kramer and Sugiyama (2011) 
define this dof correction in the PLSR framework (without missing 
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data) and apply it to the BIC criterion. We used the R package 
plsdof, based on Kramer and Sugiyama (2011) work, where this 
criterion is implemented as follow: 

BICdof = RSS/n + log(n)(j/n)a e 2 . 


where 7 is the degree of freedom ( dof ) of the model (1.4) and a 2 
is defined by Kramer and Sugiyama (2011). 

The selected model is the one which realize the first local mini¬ 
mum of this BICdof criterion. We also extract results by retaining 
models realizing the global minimum and will refer to them as 

BICglob. 

• In PLSGLR: 

1. AIC. The AIC criterion (Akaike, 1974) can be computed, what¬ 
ever is the distribution involved. However, no corrected dof have 
been determined in this PLSGLR framework. 

2. BIC. As for the AIC, the BIC is calculable without corrected dof. 

3. CV — MClassed. This criterion could only be used for PLS-LR. 
Through a 5-CV, it determines for each model the number of 
predicted miss-classed values. The selected model is the one linked 
to the minimal value of this criterion. 

4. p val. Bastien et al. (2005) define a new component t*, as non¬ 
significant if there is not any significant predictors within it. An 
asymptotic Wald test is used to conclude to the significance of the 
different predictors. 

2.2. Bootstrap based criterion 

2.2.1. Motivations, assumption and definitions 

All the criteria described just above have major flaws including arbitrary 
bounds dependency, results based on asymptotic laws or derived from q -CV 
which naturally depends on the value of q and on the way the group will be 
randomly composed. 

For this purpose, we adapted a non-parametric bootstrap technique in order 
to test directly, with some confidence level (1 — a), the significance of the 
different coefficients c& by extracting confidence intervals (Cl) for each of 
them. 
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By maximizing Cov (y, IT^), each new component t k represents the 

best choice of a new dimensional space direction, extracted from the remain¬ 
ing information, explaining at best y — E (y |Tfc_i). This process implies the 
following result. 

Proposition 2.1. Let y^-i) and X^.q respectively be the deflated response 
vector and predictors jnatrix on k — 1 components. Suppose that, 

Vk G [I ,K],3i G [1,p 1, x^ (fc _ 1) y ( fe_ 1) ^ 0, 

then the PLS components building process implies that: 

Wk e [1,1E], Cfc > 0 and, conditionally to X, c k is linked to a positive 
distribution. 

Bootstrapping pairs (y, X) in order to test the significance of these pa¬ 
rameters is not suitable since it will approach the conditional distribution 
given X. Thus, to test each new component, we approach the conditional 
distribution of these coefficients given T/ ;; by bootstrapping pairs (y,T/,). 

We define the significance of a new component as resulting from its sig¬ 
nificance for both y and X, so that the extracted number of component k is 
defined as the last one which is significant for both of them. 

2.2.2. Adapted bootstrapping pairs as a new stopping criterion 

The so-called bootstrapping pairs was introduced by Freedman (1981). 
This technique only relies on the assumption that the originals pairs (?/*, t i# ), 
where t,;. represents the i th row of T*,, are randomly sampled from some 
unknown (k + 1)-dimensional distribution. This technique was developed to 
treat the so-called correlation models, in which predictors are considered as 
random and e may be related to them. Thus, it is appropriate to “estimate 
the regression plane for a certain population on the basis of a simple random 
sample” (Freedman, 1981, p.1219). 

Based on our definition of the significance of a new component, we adapted 
this method by designing a double bootstrapping pairs algorithmic imple¬ 
mentation. The first step consists in bootstrapping pairs (X. T fc ), leading to 
a maximal number k max of components which can be extracted. The second 
step consists in bootstrapping pairs (y,T/,) to test the significance of each 
successive component t k , with k ^ k max - To avoid confusions between the 
number of predictors and the coefficients of the regressions of X on T/,., we 
set m as the total number of predictors. 

The algorithm of this double bootstrapping pairs implementation is designed 
as follow : 
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Bootstrapping (X. T fc ): let k — 1 and l — 1,... ,m. 


1. Compute the k first components (t 1? . .., t k ). 

2. Bootstrap pair (X,T fc ), returning R bootstrap samples: 

(X,T k ) b \...,(X,T k ) bR . 

3. For each (X, T/, : ) fer , do m Ordinary Least Squares (OLS) regres¬ 
sions: 

x' r =E (Ph br 4)+4- 

3 =1 

4. Wpik, construct a (100 x (1 — a))% bilateral BC a Cl, noted: 

ci, = [cif;„ciy. 

5. If 3/ G {1,..., m}, 0 ^ Cl;, then k — k + 1 and return to step 1. 
Else, k max k 1. 

• Bootstrapping (y, T/,): let k — 1. Note that for PLSGLR, a generalized 
regression is performed at step 3. 

1. Compute the k first components (ti,..., t^). 

2. Bootstrap pair (y. T/, : ), returning R bootstrap samples: 

(y,T t )\...,(y,T,y». 

3. For each pair (y. T\.) 6r , do the OLS regression: 

y br =t («yq)+4'. 

3=1 

4. Since c k > 0, construct a (100 x (1 — a))% unilateral BC a CI: 

Cl = [ClJ, +oo [ for Cfc. 

5. While CI^ > 0 and k ^ k max , do k = k + 1, and return to step 
1. Else, the final extracted number of components is K = k — 1. 

Results linked to this bootstrap-based criterion will be graphically re¬ 
ported as BootYT. 
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2.2.3. Illustration of CV issues: first applications on real datasets 

As mentioned by Boulesteix (2014), important issues concerning the sta¬ 
bility of the g-fold CV procedure for the choice of tuning parameters, here 
the number of components, have been observed. These issues are directly 
induced by the value of q and by the random character of this resampling- 
based procedure while splitting the original dataset into two distinct sets, a 
training one and a test one. To illustrate consequences on the tuning param¬ 
eter, we treated two real datasets. 

The first dataset was collected on patients carrying a colon adenocarcinoma. 
It has 104 observations on 33 binary qualitative explanatory variables and 
one binary response variable representing the cancer stage according to the 
Astler-Collcr (AB vs CD) classification (Astler and Coller, 1954). This bi¬ 
nary response lead us to perform PLS-Logistic regressions. This dataset, 
named aze_compl, is available in the R package plsRglm (Bertrand et al., 
2014). 

We performed a 100 times the determination of the number of components 
using the CV-MClassed criterion with q £ {3,5,10,15,30}. Then, we per¬ 
formed the same process by using our new bootstrap-based criterion. Results 
are reported in Fig.l. 
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Figure 1: Extracted numbers of component using g-folds CV-MClassed (Left) 
and BootYT (Right) criteria 

Results obtained through g-fold CV, with q n, and displayed on Fig.l 
are typical examples of these issues. Depending on the choice of q and on the 
way the different fold are split, the extracted number of components could 


























be dramatically different. In addition, obtaining a complete distribution of 
the number of components is mostly impossible, due to the high number of 
different possibilities for splitting the original datasets into q groups. 

Proposition 2.2. Let n = pq + r, 0 ^ r ^ q — 1 be the euclidean division of 
n by q. Then, the number of distinct partitions of the original dataset into 
r (p + 1 )-element subsets and (q — r) p-element subsets for a CV does not 
depend on the order of their placement, and is equal to: 



( 2 . 1 ) 


The leave-one-out CV, which is the only complete CV (f(n,n) = 1, ie 
only one possibility for choosing n folds out of n subjects), concludes to one 
component to extract. However, it suffers from variance issues concerning the 
bias-variance tradeoff on the estimation of the prediction error (Hastie et al., 
2009) (Kohavi, 1995). Our new bootstrap-based criterion is more stable on 
this dataset and allows the user to choose the number of components, in this 
case three, through a more accurate process. 

The second example is a benchmark dataset, called “Processionnaire du 
Pin”, which is treated in depth by Tenenhaus (1998). It has 33 observation on 
10 explanatory variables and is also available in the R package plsRglm under 
the name pine. More details on this dataset are available in (Tenenhaus, 


1998). 


The same process was applied on this second practical case, with usual PLS 
regressions. Thus, we compare the Q 2 criterion obtained through g-fold CV, 
q G {2, 3, 5,10,15}, and our new bootstrap-based criterion. The Q 2 criterion 
obtained through leave-one-out CV concludes to one significant component. 
Others results are displayed in Fig.2. 

In this case, the g-folds CV does not suffer from stability issues, as those 
seen just before, since the Q 2 criterion is much more stable than the min¬ 
imisation of the number of miss-classed values. However, extracting one 
component is not recommended. Tenenhaus (1998), after a complete anal¬ 
ysis of this dataset, proves that four components is the best decision. This 
under-estimating issue linked to the Q 2 criterion will be verified and con¬ 
firmed in part 3.1, by treating results obtained through datasets simulations. 
Thus, while the Q 2 criterion under-estimates this optimal number of com¬ 
ponents, our new bootstrap-based criterion concludes to four components in 
more than 80% of results. 
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BootYT 



Figure 2: Extracted numbers of component using g-folds Q 2 (Left) and 
BootYT (Right) criteria 


2.3. Simulation plan 

To compare these different criteria, datasets simulations have been per¬ 
formed by adapting the sirnuLdata_ Uni YX function, available in the R pack¬ 
age plsRglm. 

Simulations were performed to obtain a three dimensions common space 
between X and y, leading to an optimal number of component equal to 
three. These three components own standard deviations respectively equal to 
(j i = 10, cr 2 = 8 and a 3 = 6 . Simulations were performed under two different 
cases, both in PLSR and PLSGLR framework. The first one is the n > p 
situation with n = 200 and p G f^oo = [7, 50]. The second one is the n < p 
situation where n = 20 and p G f2 20 = [25, 50]. For each fixed couple (oq, cr 5 ), 
which respectively represents the standard deviation owned by the useless 
fourth component present in X and the random noise standard deviation in 
y, we simulated 100 datasets with pi predictors, 1 = 1 ,..., 100 , obtained by 
sampling with replacement in f2 n .The X matrices are so built around four 
orthogonal components. Finally, the number of bootstrap samples was fixed 
to R = 500. 

3. PLSR results 

As mentioned in part 2.3, the optimal number of components linked to 
these simulated datasets is equal to three. A four components model will thus 
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be too complex, even if the estimated response will not significantly change 
since Corr(y,t 4 ) ~ 0. Indeed, this fourth component only improves the 
representation of the original predictors by modelling useless informations 
in X, but is not helpful for a better estimation of y. All supplementary 
component is built up with some random noise present in X in order to 
explain remaining knowledge in y. 

3.1. PLSR: Case n > p 

For this framework, we consider the following values for noises standard 
deviations: 

, f a 4 e {0.01, 0.21,..., 5.81} U {6.01, 7.01,..., 30.01} 

: { a 5 e {0.01, 0.51,..., 20.01} 

These sets of values lead so to 2255 different couples (< 74 , < 75 ). Results are 
so stored in five tables (one per criterion) of dimension 2255 x 100. Columns 
correspond to the 100 datasets simulated per couple. 

In order to perform a first selection among the existing criteria, we focus 
on results obtained with: 

f C7 4 e {0.01,0.21,...,5.81} 

1 J ' \ a 5 e {0.01,0.51,...,20.01} 

We extract row means for each studied criteria (except for BootYT), and 
report them in Fig.3 and 4 as functions of (74 and (7,5. 




Figure 3: n > p, case (R); Left: BICdof row means; Right: BICglob row 
means 

Based on results mapped out in Fig.3, the BICglob criterion has some 
defaults concerning the stability of its results. This issue is mainly due to 
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the non necessarily increase of the adapted dof as a function of the number 
of components. As a direct consequence, adding a component sometimes 
lead to smaller dof than the previous model ones which leads this criterion 
to overestimating issues with huge variability levels. Thus, the BICglob 
criterion has to be avoided. 

Extracting the optimal number of components K as the one which is 
linked to the model realising the first local minimum of this adapted criterion, 
permits to focus the comparison on the first models, which are mainly linked 
to increasing dof. Thus, issues linked to the BICglob criterion are solved 
and results displayed in Fig.3, concerning the BICdof criterion, matched 
to the expected values. The increase of the displayed response surface for 
the highest values of <74 will be treated while analysing the entire response 
dataset. 



Figure 4: n > p, case ( B ); Left: Q21vlo row means; Right: Q2K5 row means 

Based on results displayed in Fig.4, the influence of the value of q is neg¬ 
ligible on this Q 2 criterion so that we will only discuss about results linked to 
the Q2K5 criterion, as recommended by Hastie et al. (2009). This criterion 
is very sensitive to the increasing noise level in y so that it globally un¬ 
derestimates the number of components. Tenenhaus (1998) mentioned this 
underestimating issue linked to this criterion while analysing the dataset in¬ 
troduced in part 2.2.3. The fact that this criterion performs an expected 
and optimal estimation of the number of component only for <75 ~ 0 is much 
more harmful for this benchmark criterion. Indeed, real datasets, especially 
in targets areas as mentioned during the introduction, are never noiseless. 

We then display three graphical representations of all the 2255 available 
row means linked to the Q2K5, BICdof and BootYT criteria in Fig.5. Each 
row variances were also extracted and graphically reported in Fig. 6 . 
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Figure 5: n > p, case (A); From left to right: Q2K5, BICdof and BootYT 
row means 



Q2K5 BICdof BootYT Q2K5 BICdof BootYT 


Figure 6 : n > p; Left: Boxplots of row variances, case (A); Right: Boxplots 
of row variances for <74 ^ 15.01 


Considering the extracted number of components as a discriminant fac¬ 
tor, we conclude that the Q 2 criterion is the less efficient criterion by being 
the most sensitive one to the increasing value of er 5 so that it globally under¬ 
estimates the number of components. Comparing BICdof and BootYT, or 
advertising one of them is quite difficult in this large n case. BICdof has a low 
computational runtime and is the less sensitive one to the increasing value of 
< 7,5 by retaining 86.37% of results equal to three or four components. However, 
referring to Fig. 6 , the variability of results linked to the BICdof is globally 
higher than the one linked to our new bootstrap based criterion, especially 

on datasets with large values of <74 ^<74 > of = \/200 ~ 14.14^ . In this 

particular case, our new bootstrap-based criterion keeps its stability while 
the median of BICdof results, for instance, is more than tripled (0.25 to 0.79) 
compared to the one linked to all data. Moreover, the BootYT is more ro- 
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bust than the BICdof to the increasing noise level in X in terms of extracted 
number of components. 

3.2. PLSR: Case n < p 

3.2.1. Row means and variances analysis 

This small training sample size, allows us to consider high-dimensional 
settings and is interesting since usually least squares regression could not be 
performed. 

We consider <j 4 G {0.01,1.01,..., 6.01} and cr 5 G {0.01,0.51,..., 20.01} 
leading to 287 different couples (<j 4 , a 5 ). Results are so stored in three tables 
of dimension 287 x 100. Row means are represented as a function of cr 4 and 
< 7,5 in Fig. 7. Graphical representations of row variances were also performed 
in Fig. 8. 



nb_comp 



nb_comp 



Figure 7: n <p\ From left to right: Q2K5, BICdof and BootYT row means 


o 



Q2K5 BootYT 



0.01 3.01 6.01 9.01 12.51 16.01 19.51 

<*5 


Figure 8: n < p\ Left: Boxplots of row variances; Right: Evolution of row 
variances 

Referring to Fig. 7, the BICdof suffers from overestimating issues. More¬ 
over, based on results display on Fig.8, it returns results linked to out of 
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range values of variance compared to the two others criteria. These two is¬ 
sues, which are mainly due to the extraction of 1678 (5.847%) results equal 
to 19 components, added to its its lack of robustness to the increasing noise 
level cr 5 , lead us to conclude that this criterion should be avoid in this n < p 
framework. 

Our new boostrap-based criterion underestimates the number of compo¬ 
nents but own an attractive robustness to the increasing noise level in y so 
that it returns results between 1.2 and 2.2 in average. Moreover, it returns 
results with low variability for fixed couple (er 4 , 05 ), contrary to the BICdof 
criterion. The Q2K5 criterion has a comparable attractive feature of stability 
but is less robust to noise level in y than our new bootstrap based criterion, 
linking the Q2K5 to globally more important underestimating issues. 

So, by considering the number of extracted components as a discriminant 
factor, we conclude that the BootYT criterion is the best one to deal with 
these n < p datasets. 


3.2.2. MSE analysis 

We then assess the predictive performances of each of these three crite¬ 
ria (BootYT, BICdof and Q2K5). Thus, for each of the 28 700 simulated 
datasets in this case, we simulated 80 more observations as test points. We 
computed both training and testing Normalized Mean Square Error (NMSE). 
The normalisation was done by dividing the training or testing MSE of the 
obtained model with the corresponding MSE linked to the trivial one (con¬ 
stant model equal to the mean of the training data). Furthermore, as men¬ 
tioned by Kramer and Sugiyama (2011, p.702), “the large test sample size 
ensures a reliable estimation of the test error.” 

We treat these predictive results for each couple of values (cr 4 , cr 5 ) by test¬ 
ing the equalities of NMSE means through asymptotic f-tests with Welch- 
Satterthwaite dof approximation (Welch, 1947). All these tests were per¬ 
formed on level a = 0.05. Results of these t-tests are graphically reported in 
Fig.9. 

Concerning BootYT vs Q2K5, the Q2K5 has a better predictive ability 
for some very low values of cr 5 . This result is not surprising since, in this 
case, the Q2K5 criterion returns numbers of components closer to three than 
BootYT does (Fig. 7). However, tests results between the BICdof and the 
Q2K5 criterion are not concluding to a significant better predictive perfor¬ 
mance of the Q2K5 criterion for small values of <75 despite the BICdof globally 
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BootYT vs BICdof BICdof vs Q2K5 BootYT vs Q2K5 



Figure 9: f-tests results: BootYT better (red), BICdof better (purple), Q2K5 
better (blue), no significant difference (green) 


overestimates the number of components in this case (Fig. 7). In fact, due 
to the small values of <75 and despite the bad estimations of the number of 
components returned by the BICdof criterion by extracting more than three 
or four components, testing NMSE react in the same way than the training 
ones i.e. the higher the extracted number of components is, the lower the 
predictive NMSE are. This fact lead us to only focus on the extracted num¬ 
ber of components when < 7,5 ~ 0, leading the Q2K5 criterion to be the best 
one. 

Finally, in all others cases, the BootYT criterion returns models with, at 
least, comparable or better predictive performances than the two others. 

3.3. PL SR: Conclusion 

In the n > p case, the BootYT criterion offers a better robustness to noise 
in y than the Q2K5. It is also more robust to the increasing noise level in X 
than the BICdof, which moreover has some variance issues for high values of 
< 74 . We also conclude the BootYT criterion as a good compromise between 
the two others criteria, owning their advantages without their drawbacks. 
Concerning the n < p case, our bootstrap-based criterion is less sensitive 
than the others to the increasing noise level in y and is linked to low variance 
results, leading it to have better predictive performances on datasets with a 
non-negligible noise level in y. 

4. PLSGLR results 

In this section, we present results on the comparison of our bootstrap- 
based criterion with 4 other criteria (AIC, BIC, CV-MClassed and p_val, see 
part 2.1). Note that, in this framework, due to the specific distribution of 
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y and link-function g we chose, the increase of cr 5 does not lead to a linear 
increase of noise level in y, as it did in datasets simulations for PLSR analysis. 

4-1- PLS-LR results 

In this case, the increasing value of a 5 does not influence the extracted 
number of components as much as it did during the PLS analysis. Indeed, 
due to the used link function: 

inv.logit : R —> ]0,1[ 

1 

X I- > - -7-r 

1 + exp ( — x) 

the increasing value of 05 does not imply an increase of the response vec¬ 
tor variance, which belong to [0; 0.25], so that the decrease of significant 
components is not implied. 

Finally, the bijectivity of the inv.logit function insures the presence of 
three common components between X and y. 

4-1.1. PLS-LR: Case n > p 

In this case, we consider cr 4 G {0.01, 0.51,..., 9.51} and cr 5 G {0.01, 0.51,..., 15.51} 
leading to 640 different couples ( 04 , 05 ). Results are so stored in five tables of 
dimension 640 x 100. We graphically report CV-MClassed, p_val and BootYT 
row means as functions of 04 and 05 as well as boxplots of their row variances 
in Fig. 10. 



Figure 10: n > p\ Left: Row means surfaces; Right: Boxplots of row variances 

Based on these graphics, the CV-MClassed performs well in estimating 
the optimal number of components in average. However, this good property 
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has to be nuanced by the high variances linked to its results and which lead 
this criterion to be used with caution. The BootYT and p_val criteria return 
similar results in this n > p case. Both of them slightly underestimate the 
optimal number of components but with the advantage of low variances of 
their results. 

We also observed results obtained with the non-corrected AIC and BIC 
criteria and display these results in Fig.ll. 



Figure 11: n > p; From top to bottom: AIC row means surface. BIC row 
means surface. 

The non-corrected dof lead the AIC and BIC criteria to globally overes¬ 
timate the number of components. Thus, these criteria have to be avoided 
until the development of a dof correction in this PLSGLR framework and 
will not be considered in the n < p case. Note that we only compared AIC 
and BIC values linked to the fc-components models with k ^ 7 and retained 
the one which realize the minimum of the studied criterion. 


4-1.2. PLS-LR: Case n < p 

In this case, we consider cr 4 G {0.01, 0.51,..., 9.51} and a 5 G {0.01, 0.51,..., 9.51} 
leading to 400 different couples ( 04 , < 7 , 5 ). Results are so stored in three tables 
of dimension 400 x 100. We set the maximal value of <75 to 9.51, and not to 
15.51 as for the n > p case, in order to save computational runtime since an 
increasing value of <75 does not affect the choice of the number of extracted 
components. 

We graphically report row means as a function of <74 and <75 as well as 
boxplots of row variances in Fig. 12. 

The CV-MClassed criterion maintains the same property of well estimat¬ 
ing in average and issue of variability as in the n > p framework. Concerning 
the two others criteria, we observed a higher underestimating issue linked to 
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Figure 12: n < p; Left: Row means surfaces; Right: Boxplots of row variances 

the p_val criterion than for the BootYT one. Furthermore, they both had 
low variability in results they return. 

4-1.3. PLS-LR: MSE and miss-classed values analysis 

In order to test their predictive performances, we simulated 80 more ob¬ 
servations for each simulated datasets (40 000), and computed the testing 
NMSE linked to each models established by the three criteria. Furthermore, 
since the binary response obtained by the model is equal to 1 if the estimated 
response is over 0.5, 0 if not, returning higher NMSE does not necessarily 
lead to higher number of miss-classed values. Thus, we also computed the 
number of predictive miss-classed values ( M_classed ) for each of these three 


criteria. 


In order to compare their predictive performances, f-tests were computed 
for each fixed values of (cr 4 , a 5 ) so that we can observe precisely which cri¬ 
terion has better performances depending on the noise level in X and y. 
Results of these tests are graphically reported in Fig. 13. 

The bootstrap-based criterion is never less efficient than the others crite¬ 
ria. If there is globally no significant differences between bootstrapping pairs 
and the p_val criterion concerning the predictive NMSE, BootYT is better 
than this criterion concerning the predictive number of miss-classed values. 
Then, there is only few cases where bootstrapping pairs is significantly better 
than the CV-MClassed criterion concerning the predictive number of miss- 
classed values. But, concerning the predictive NMSE, the BootYT criterion 
is better than this last one by returning significant smallest NMSE values, 
especially for high values of <r 5 . 

The boostrap-based criterion is also the best one by having, at least, 
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Figure 13: t- tests results: BootYT better (red), CV-MCIassed better 
(turquoise), p_val better (yellow), no significant difference (green) 


similar predictive performances compared to the two others. 

4-1-4- PLS-LR: Conclusion 

Through these simulations, we can reasonably assume that the bootstrap- 
based criterion is globally more efficient than the others ones. I 11 the n > p 
case, it offers a similar stability compared to the p_val criterion. However, 
it globally underestimates the optimal number of components when the CV- 
MClassed criterion retains it on average but with high variabilities. Concern¬ 
ing the n < p case, the BootYT criterion has better predictive performances 
than the two others studied criteria in terms of predictive NMSE and predic¬ 
tive miss-classed values. It also keeps a quite low variability, which is really 
important for a future routine implementation. Finally, the AIC and BIC 
are clearly not adapted since the corrected dof are not established yet. 
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4-2. PLS-PR results 

4-2.1. PLS-PR: Row means analysis 

Results were stored in four tables of dimension 440 x 100 in the n > p 
case and 360 x 100 in the n < p framework. Each of the 240 first rows 
correspond, in both cases, to results for a fixed couples of values (<74,(75), 
with <74 G {0.01, 0.51,..., 9.51} and cr 5 G {0.01, 0.21,..., 2.21}. They so 
correspond to results linked to simulations with a noise parameter <75 globally 
lower than the available information standard deviation we approached and 
which is so approximatively equal to 1.727. 

In the n > p case, the 200 last rows correspond to results for fixed couples 
of values (<74, (75), with <74 G {0.01, 0.51,..., 9.51} and <75 G {2.51, 3.01,..., 7.01}. 
Concerning the n < p case, the last 120 rows correspond to results for fixed 
couples (<7 4 , cr 5 ), with ct 4 G {0.01, 0.51,..., 9.51} and a 5 G {2.51, 3.01,..., 5.01}. 

AIC, BIC, p_val and BootYT row means are displayed as functions of <74 
and <75 in Fig. 14. 



sigma5 5 sigmaS 


Figure 14: From top to bottom: AIC, BIC, p_val and BootYT row means 
surfaces; Left: n > p case; Right: n < p case 

Except for the bootstrap-based criterion, all the criteria return an in¬ 
creasing number of components as cr 5 increases. These results lead us to 
conclude that our new bootstrap-based stopping criterion is the only one 
which is adapted to a Poisson distribution. Based on these graphical repre¬ 
sentations, no additional analysis of these numbers of components was done. 
Moreover, we decided to only compare predictive performances of both p_val 
and BootYT criteria. 

4-2.2. PLS-PR: MSE analysis 

We extracted the training log(MSE) as well as the testing NMSE, based 
on 80 supplementary simulated data, in the n < p case and displayed their 
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means for each value of cr 5 in Fig. 15. 
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Figure 15: n < p; Left: Evolution of training log(MSE) means; Right: Evo¬ 
lution of testing NMSE means 

The global decrease of the log(MSE) linked to the p_val models confirms 
that this criterion will model the random noise in y. On the contrary, the 
bootstrap based criterion returns a regular increase of the log(MSE), which 
empirically proves that it better succeeds in separating the real common 
information from the random noise. The overfitting issue linked to the p_val 
criterion has some major consequences on its predictive ability and lead so 
to higher NMSE than the BootYT ones. Furthermore, the BootYT criterion 
is more robust than the p_val one to the increasing noise level in y. It 
returns consistent testing NMSE for datasets linked to value of a 5 lower 
than the common information standard deviation, i.e < 1.727, and even 
returns quite reasonable testing NMSE means up to er 5 = 2.51, while the 
p_val criterion returns out of range testing NMSE for cr 5 ^ 1.01. 

We then displayed the evolution of testing NMSE variances in Fig. 16. 



Figure 16: n < p\ Evolution of testing NMSE variances 

Results obtained by the bootstrap based criterion are linked to acceptable 
variances while <75 < 1.727, so that we can postulate that while the random 
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noise standard deviation is lower than the real common information one, this 
new criterion returns stable results. To the contrary, the p_val models are 
linked to high variability of their testing NMSE. This fact is due to their 
over-complexities we described in part 4.2.1. 

However, these out-of-range variances lead also to non-significant means 
differences while using f-tests on these datasets although means differences 
are notable, based on results displayed in Fig. 15. 

To obtain consistent t-tests outcomes to compare the predictive ability 
of these criteria, we had to use results we obtained in the n > p case. We 
so simulated 100 supplementaries subjects as testing sets. The evolution of 
means and variances of testing NMSE for both bootstrap based criterion and 
p_val criterion are reported in Fig. 17. 



Figure 17: n > p ; Left: Evolution of testing NMSE variances; Center: Evolu¬ 
tion of testing NMSE means on all data; Right: Evolution of testing NMSE 
means for < 7,5 ^ 2.51 

Based on these graphics, we have to notice that models build up with 
the bootstrap-based criterion are on average better than the trivial ones 
for 05 ^ 2.51 while p_val models become worse than these trivial ones for 
(75 > 1.81 (Fig.17). Thus, p_val fails to construct better models than the 
trivial ones when the noise level in y is higher than the common information 
standard deviation. Then, both criteria return low variances of their testing 
NMSE for <r 5 ^ 3.01 so that f-tests return consistent outcomes on this range 
of values. Results of these f-tests are displayed in Fig. 18. 

Based on these f-tests results, we conclude that to set up a predictive 
model on a dataset with significant random noise in it, our new boostrap 
based criterion is the one which should be advised. Note that non-significant 
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Figure 18: t-tests results: BootYT better (red), no significant difference 
(green) 


differences for <75 ^ 3.51 are due to the high increase of variances linked to 
the p_val results (see Fig. 17). 

4-2.3. PLS-PR: Conclusion 

In the case of response vector y linked to a Poisson distribution with a 
significant random noise in it, the bootstrap based criterion stands out as 
the only one which should be used. Indeed, others could be interpreted as 
increasing functions of < 75 , so that they will model the random noise in y, 
leading to overfitting issues. As a direct consequence, they return models 
with poor predictive abilities compared to the ones our new criterion build 
up. 

5. Application to a real dataset 

In this section, we focus on a allelotyping study. Our method is applied 
on a dataset that concerns 267 subjects with colon cancer. Measures were 
done on 33 microsatellites in search of an allelic imbalance that indicates an 
abnormal number of allele copies of a nearby gene of interest. The aim of 
the study was to find the microsatellites subsets that would best discriminate 
left and right colon tumors. Thus, the univariate response corresponds to the 
original localisation of a colon tumor, leading to a binary response y, taking 
value 0 (respectively 1 ) if the localisation was on the right colon (respectively 
on the left). More details on it are available in Weber et al. (2007). 

This dataset contains missing values, so that we did a preprocessing in order 
to complete it by using the R package mice. As y is a 0-1 response, we 
used the three following stopping criteria in components construction: our 
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new bootstrap-based criterion, the CV-MClassed and the p_val ones. The 
corresponding extracted numbers of components are respectively equal to 
6 , 5 and 4. The bootstrap-based criterion selects more components than 
the CV-MClassed one, which does not match to what we observed in the 
simulations (see Fig. 10). 

In fact, results displayed in Fig. 10 are obtained by computing the means of 
extracted number of components obtained on the 100 simulated datasets for 
each fixed couple (< 74 , er 5 ). As we mentioned in part 4.1, results based on the 
CV-MClassed criterion have a high variability. To take this variability into 
account, we computed the CV-MClassed criterion 100 times on our dataset, 
leading to the distribution of the extracted number of components reported 
in Fig. 19. Then, by computing the mean of the 100 values of extracted 
numbers of components, we obtained 7.99, a result similar to the simulation 
ones. Thus, our simulation results are validated on this real case study. We 
also perform the same process using our new bootstrap based criterion and 
report the results in Fig. 19. 


CV-MClassed 


fl 


1 2 3 4 5 


Number of components 


BootYT 



Number of components 


Figure 19: Distribution of the extracted number of components with the 
q = 5 CV-MClassed criterion (Left) and BootYT (Right) 

Based on distributions in Fig. 19, the major default of the CV-MClassed 
criterion, namely the dependency of the extracted number of components 
to the way the group will be randomly formed, is clearly showed out. Thus, 
performing a single CV to find the number of components using this criterion 
must be avoided. As expected, the BootYT criterion returns stable results 
and conclude, in almost 80% of cases, in 6 components. 

We also test the robustness of these three criteria through a bootstrap 
resampling process, with 100 boostrap iterations, and a jackknife one. These 
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two resampling methods leads to distributions of the extracted number of 
components linked to each of the three compared criteria. Results are graph¬ 
ically reported in Fig. 20. 


BootYT p_val CV-MCIassed 



p_val CV-MCIassed 


1 5 9 13 1 5 9 13 1 5 9 13 


Figure 20: Distribution of extracted number of components through boot¬ 
strap (Left) and jackknife (Right) resampling processes 

These results confirm the high resampling robustness of our new bootstrap- 
based criterion compared to the CV-MCIassed one. The p_val criterion owns 
a comparable robustness but, based on our simulation results, with a higher 
bias. Based on these results and our simulation ones, we can reasonably 
conclude that for this dataset, the optimal number of components is 6. 

6. Discussion 

We developed a new bootstrap based stopping criterion in PLS compo¬ 
nents construction, characterised by a high level of stability and robustness 
with respect to noise levels compared to others criteria usually used. 

This bootstrap based criterion has a shortcoming in that its computa¬ 
tional runtime is higher than the others since it requires (k x ({pi + 1) x R)) 
least squares regressions per dataset. A first improvement has already be 
done by developing a parallel processing for this criterion. Furthermore, the 
development of corrected dof in PLSGLR framework would also permit to 
develop a corrected BIG formulation in this framework. This corrected BIG 
could provide an interesting alternative to the boostrap-based criterion since 
it could save an important computational runtime conditionally to the fact 
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that it would have, at least, similar properties to those we conclude in part 


3.3. 


However, this new bootstrap-based criterion represents a reliable, con¬ 
sistent and robust stopping criterion in order to select the optimal number 
of PLS components. It can also be performed both for PLSR and PLSGLR 
frameworks and allows users to test the significance of a new component with 
a preset risk level a. 

In the n > p PLSR framework, our simulations confirm the BICdof as be¬ 
ing an appropriate and well designed criterion. However, our new bootstrap- 
based criterion is an appropriate alternative in the n < p case, since the 
BICdof criterion suffers from high variances and overestimating issues, es¬ 
pecially for models with low random noise levels in y. Furthermore, both 
BICdof and Q2K5 criteria are more sensitive than the bootstrap-based cri¬ 
terion to the increasing noise level in y in this case. 

Concerning the PLSGLR framework, our simulations results based on 
two specific distributions (Binomial and Poisson) lead to recommend this 
new bootstrap-based criterion. Indeed, in the PLS-LR case, we show that 
depending on the statistic we used (testing NMSE or predictive number of 
miss-classed values) to test its predictive ability, the bootstrap-based cri¬ 
terion is never significantly worse than both the CV-MClassed and p_val 
criteria. Concerning results we obtained for a response vector under a Pois¬ 
son distribution, the bootstrap-based criterion is the only one which returns 
consistent results by retaining a decreasing number of components while the 
random noise level in y increases. By adding to this fact the MSE analysis 
and the obtained t -tests results, we can reasonably advise to apply this new 
criterion in such a framework. 

Appendix A. Proof of Proposition 2.2 

Proof. Let n = pq + r be the euclidian division of n by q. Then, a par¬ 
tition of the dataset for CV is composed of r (p + l)-element subsets and 
(q — r ) p-element subsets, noted A k , 1 ^ k ^ q. We set q 0 = 0 and let 
E = {qi, ..., q r } , 1 ^ qj ^ q, Vj G [1, r] be a subset of {1,..., q} containing 
the ordered set of indices of the (p + l)-element subsets so that: 
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Let us first determine the number of distinct partitions of {1,..., n} which 
can be written as: 


{Ai,.. • ) A qi — ! ) A qi , t4 9i _|_i, . . . , Aq r — !, Aq r , Aq r + 1 , . . . , Aq } (A.l) 


To lighten the formulas, let us set u> = {qj G E \ qj — q 3 -\ ^ 1} and rrij = 
qj — qj-i — 1. Then, by knowing that, for any set containing n elements, the 
number of distinct p-element subsets of it that can be formed is given by (”), 
we obtain that the number of distinct partitions of the form (A.l) is equal 
to: 
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Each second factorial term in the denominator being equal to the numer¬ 
ator of the following product term, we obtain that: 
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Finally, since the order of parts creation within each of the two classes 
of subsets ((p+1) and p-element subsets) is not useful in distinguishing one 
partition from another in a CV framework, we had to divide our result by 
the number of distinct permutations it exists in each class, so: 



(A.2) 


This result is therefore not dependant on E. 
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